1 Introduction

In this analysis, I will build different types of machine learning (ML) models to predict the rental price of apartments in Madrid. The key features used for prediction include the surface area, number of bathrooms, number of rooms, and the neighborhood where the apartment is located.

The data for this analysis was obtained through a web scraping project from the real estate website Fotocasa.com, which is a well-known and widely used platform in Spain for property listings.

2 Libraries

library(dplyr)
library(caret)
library(ModelMatrixModel)
library(mgcv)
library(ggplot2)
library(sf)
library(car)
library(corrplot)
library(MASS)
library(RcmdrMisc)
library(randomForest)
library(gbm)
library(mpae)
library(knitr)
library(pdp)
library(gridExtra)
library(glmnet)

3 Preprocessing the data and Exploratory Data Analysis.

data<-read.csv2("/Users/aa/Downloads/apartment_data.csv",sep=",",header=T)
madrid<-data[251:nrow(data),]
summary(madrid)
##     Price           Square.Meters         Rooms           Bathrooms      
##  Length:5489        Min.   :   1.00   Min.   :  1.000   Min.   :  1.000  
##  Class :character   1st Qu.:  45.00   1st Qu.:  1.000   1st Qu.:  1.000  
##  Mode  :character   Median :  70.00   Median :  2.000   Median :  1.000  
##                     Mean   :  85.55   Mean   :  2.195   Mean   :  8.502  
##                     3rd Qu.: 105.00   3rd Qu.:  3.000   3rd Qu.:  2.000  
##                     Max.   :8300.00   Max.   :430.000   Max.   :695.000  
##                     NA's   :763       NA's   :2         NA's   :10       
##      Zona          
##  Length:5489       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
glimpse(madrid)
## Rows: 5,489
## Columns: 5
## $ Price         <chr> "990", "1.400", "1.165", "1.800", "1.110", "1.206", "11.…
## $ Square.Meters <int> 68, 71, 55, 30, 390, 250, 82, 125, 90, 82, 152, 70, 42, …
## $ Rooms         <int> 2, 1, 3, 1, 3, 3, 2, 2, 3, 2, 3, 2, 1, 3, 1, 2, 1, 3, 1,…
## $ Bathrooms     <int> 2, 1, 1, 1, 4, 4, 2, 2, 1, 2, 3, 1, 1, 2, 1, 2, 1, 2, 34…
## $ Zona          <chr> "Tetuán", "Salamanca", "Carabanchel", "Centro", "Caraban…

From the summary function, we can obtain plenty of information from the data. The first step would be to change the variable type for “Price” since it is essential to have it as numeric. Next, it is noticeable that there is a significant amount of NA data, which will be problematic for the subsequent stages of the analysis. Therefore, it is better to remove those rows with NA values from the dataset.

Afterwards, it is essential to analyze the distribution of the variables and their potential influence on the target variable. For instance, the relationship between the number of rooms and the rent price should be examined in detail.

The exploratory data analysis will include various graphical representations such as histograms, scatter plots, and boxplots to visualize the relationships between the variables.

madrid$Price<-as.numeric(gsub("\\.","",madrid$Price))
madrid<-madrid%>%
    filter(Zona!="Capital")
madrid <- madrid %>%
  mutate(Zona = ifelse(Zona %in% c("Barrio de Salamanca", "Salamanca"), "Salamanca", Zona)) %>%
  mutate(Zona = ifelse(Zona %in% c("Blas", "San Blas"), "San Blas - Canillejas", Zona)) %>%
  mutate(Zona = ifelse(Zona %in% c("Vallecas","Villa de Vallecas"), "Villa de Vallecas", Zona)) %>%
  mutate(Zona = ifelse(Zona %in% c("Lineal", "Ciudad Lineal"), "Ciudad Lineal", Zona)) %>%
  mutate(Zona = ifelse(Zona %in% c("Moncloa - Aravaca", "Aravaca"), "Moncloa - Aravaca", Zona)) %>%
  filter(Bathrooms < 10) %>%
  filter(Square.Meters > 10) %>%
  mutate(Zona = as.factor(Zona))

madrid<-na.omit(madrid)  
summary(madrid)
##      Price       Square.Meters         Rooms          Bathrooms    
##  Min.   :  450   Min.   :  11.00   Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 1400   1st Qu.:  55.00   1st Qu.: 1.000   1st Qu.:1.000  
##  Median : 2100   Median :  75.00   Median : 2.000   Median :1.000  
##  Mean   : 2575   Mean   :  99.17   Mean   : 2.136   Mean   :1.644  
##  3rd Qu.: 3172   3rd Qu.: 114.00   3rd Qu.: 3.000   3rd Qu.:2.000  
##  Max.   :37000   Max.   :8300.00   Max.   :10.000   Max.   :8.000  
##                                                                    
##         Zona     
##  Centro   :1046  
##  Salamanca: 587  
##  Chamberí : 532  
##  Tetuán   : 278  
##  Chamartín: 252  
##  Retiro   : 236  
##  (Other)  :1123
glimpse(madrid )
## Rows: 4,054
## Columns: 5
## $ Price         <dbl> 990, 1400, 1165, 1800, 1110, 1206, 11000, 4500, 1250, 18…
## $ Square.Meters <int> 68, 71, 55, 30, 390, 250, 82, 125, 90, 82, 152, 70, 42, …
## $ Rooms         <int> 2, 1, 3, 1, 3, 3, 2, 2, 3, 2, 3, 2, 1, 3, 1, 2, 1, 3, 2,…
## $ Bathrooms     <int> 2, 1, 1, 1, 4, 4, 2, 2, 1, 2, 3, 1, 1, 2, 1, 2, 1, 2, 1,…
## $ Zona          <fct> Tetuán, Salamanca, Carabanchel, Centro, Carabanchel, Sal…

First things first, there are pretty high values in Price and Square.Meters that look like outliers.

ggplot(madrid, aes(x = Price)) +
  geom_histogram(binwidth = 500, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Price",
       x = "Price",
       y = "Frequency") +
  theme_classic()

ggplot(madrid, aes(x = Square.Meters)) +
  geom_histogram(binwidth = 100, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Square.Meters",
       x = "Price",
       y = "Frequency") +
  theme_classic()

Both histograms are very asymetrical specially Square Meters indicating the presence of big outliers.

ggplot(madrid, aes(x = `Square.Meters`, y = Price,color = Zona)) +
  geom_point() +
  labs(title = "Price vs. Square Meters",
       x = "Square.Meters",
       y = "Price") +
  theme_classic()

Here we can clearly see two big outliers for Square.Meters and three for the Price variable.

Now lets remove the first three biggest values in Price.

top_prices_indices <- order(madrid$Price, decreasing = TRUE)[1:3]
madrid$Price[top_prices_indices]
## [1] 37000 25000 25000
madrid <- madrid[-top_prices_indices, ]

These data prices are huge so its better to remove them otherwise we can have problems with the regression models.

Following the same procedure for the two values in Square Meters.

top_prices_indices2 <- order(madrid$Square.Meters, decreasing = TRUE)[1:3]
madrid$Square.Meters[top_prices_indices2]
## [1] 8300 7575 1308
madrid <- madrid[-top_prices_indices2, ]
madrid_subset <- madrid[, !names(madrid) %in% "Zona"]
plot(madrid_subset)

There is some type of linear relationship between some variables like rooms and square meters and also bathrooms and square meters, this is not really surprising considering that in general if a aparment has lots of rooms it also has to have more spacel.This relationship between variables can be a problem when using linear regression models,so its important to take a look at the colinearity when using those types of models.

This plot shows the relationship between Price and some variables in the data set. It`s clear that there’s a linear trend and also a lot of outliers in the data.The data points seem to have a lot of variance, to reduce the variability and increase the performance of the models the log transformation will be used in the Price column, this way the data will look like this.

Now the data its less scatter and more tight.The trend now looks a bit different from the other plot,you could say that there’s a cuadratic relationship between Prices and Square Meters, the interpretation would be that the prices increases as the square meters increase but only up to a point where the increase of the price is smaller in marginal terms.

In the case of the variable Zona we have 23 neighborhoods for the city of madrid.Here`s a map that shows the amount of aparments for each district.

## Reading layer `Distritos' from data source 
##   `/Users/aa/Downloads/Distritos/Distritos.shp' using driver `ESRI Shapefile'
## Simple feature collection with 21 features and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 424753.5 ymin: 4462566 xmax: 456033.3 ymax: 4499365
## Projected CRS: ETRS89 / UTM zone 30N
##  [1] Tetuán                Salamanca             Carabanchel          
##  [4] Centro                Retiro                Arganzuela           
##  [7] Hortaleza             Chamartín             Pardo                
## [10] Moncloa - Aravaca     Chamberí              Puente de Vallecas   
## [13] Vicálvaro             Ciudad Lineal         San Blas - Canillejas
## [16] Villa de Vallecas     Moratalaz             Usera                
## [19] Villaverde            Latina                Fuencarral - El Pardo
## [22] Barajas              
## 22 Levels: Arganzuela Barajas Carabanchel Centro Chamartín ... Villaverde

Frequency Table of Zona
Var1 Freq
Arganzuela 175
Barajas 21
Carabanchel 93
Centro 1045
Chamartín 252
Chamberí 530
Ciudad Lineal 111
Fuencarral - El Pardo 47
Hortaleza 78
Latina 81
Moncloa - Aravaca 156
Moratalaz 8
Pardo 35
Puente de Vallecas 19
Retiro 235
Salamanca 586
San Blas - Canillejas 73
Tetuán 278
Usera 68
Vicálvaro 21
Villa de Vallecas 73
Villaverde 63

A good portion of the aparments are located in the city center and closer areas, in the other hand cheaper neighborhoods have less presence in the data set, this can explain why the average prices is so high (2570.575).

*Note: I think Fotocasa.com has change the way neighborhood info is display, now instead of having the neighborhood name it has the street name. This could be great for doing spatial analysis.

4 Modeling the data

This analysis is going to be performed from a machine learning perspective, this means that first the data will be split into two different groups,train and test, the train part is designed to adjust the models parameters and the test is for evaluating the performance. Sencondly the aim of the data modelling is going to be maximizing the accuracy of the predictions,that means we will be less strict with models assumptions.

set.seed(1)
nobs <- nrow(madrid)
itrain <- sample(nobs, 0.8 * nobs)
train <- madrid[itrain, ]
test <- madrid[-itrain, ]
train$Price <- log(train$Price)
test$Price <- log(test$Price)
dim(train)
## [1] 3238    5
dim(test)
## [1] 810   5

4.1 Linear Regression

Linear regression is a popular and straightforward model that is easy to apply and provides a clear understanding of how variables interact with each other. The main objective of linear regression is to find a line that minimizes the following equation.

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] Where: \(n\) is the number of observations, \(y_i\) is the actual value, \(\hat{y}_i\) is the predicted value.

In simpler terms, the goal is to identify a line in the variable space that minimizes the differences between the observed values (e.g., actual prices) and the predicted values generated by the model.

While linear regression is a great starting point due to its simplicity and interpretability, it is not always the most flexible model for handling diverse data. This limitation arises from the assumptions it makes:

1.Linearity: The relationship between the predictors and the response variable is assumed to be linear. If the true relationship is nonlinear, the model may struggle to fit the data accurately.In this case this is not an issue since the data seems to have linear relationships as said before.

2.Independence: The observations are assumed to be independent of each other.

3.Homoscedasticity: The variance of the residuals (errors) should remain constant across all levels of the predictors.

4.Normality of Residuals: The residuals are assumed to follow a normal distribution.

5.No Multicollinearity: The predictors should not be highly correlated with each other, as multicollinearity can make the model unstable and affect the interpretation of coefficients.

Violating these assumptions can lead to biased estimations or poor predictive performance.

Lets fit the model.

linear_model<-lm(Price~Square.Meters+Rooms+Bathrooms+Zona,data=train)
summary(linear_model)
## 
## Call:
## lm(formula = Price ~ Square.Meters + Rooms + Bathrooms + Zona, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.19928 -0.20420 -0.01267  0.21600  1.52373 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                6.8953059  0.0302371 228.041  < 2e-16 ***
## Square.Meters              0.0029653  0.0001584  18.724  < 2e-16 ***
## Rooms                      0.0299308  0.0082933   3.609 0.000312 ***
## Bathrooms                  0.1938949  0.0122566  15.820  < 2e-16 ***
## ZonaBarajas               -0.2502175  0.0880229  -2.843 0.004502 ** 
## ZonaCarabanchel           -0.3197336  0.0483269  -6.616 4.31e-11 ***
## ZonaCentro                 0.1964012  0.0302205   6.499 9.34e-11 ***
## ZonaChamartín              0.1453546  0.0364915   3.983 6.95e-05 ***
## ZonaChamberí               0.2900938  0.0324892   8.929  < 2e-16 ***
## ZonaCiudad Lineal         -0.1459336  0.0450800  -3.237 0.001219 ** 
## ZonaFuencarral - El Pardo -0.2038817  0.0604059  -3.375 0.000746 ***
## ZonaHortaleza             -0.1411821  0.0500151  -2.823 0.004790 ** 
## ZonaLatina                -0.2258216  0.0476419  -4.740 2.23e-06 ***
## ZonaMoncloa - Aravaca     -0.0390976  0.0411054  -0.951 0.341598    
## ZonaMoratalaz             -0.3797751  0.1519834  -2.499 0.012511 *  
## ZonaPardo                 -0.0662384  0.0670824  -0.987 0.323512    
## ZonaPuente de Vallecas    -0.4010089  0.0906430  -4.424 1.00e-05 ***
## ZonaRetiro                 0.1958101  0.0367823   5.323 1.09e-07 ***
## ZonaSalamanca              0.3424533  0.0320946  10.670  < 2e-16 ***
## ZonaSan Blas - Canillejas -0.3472714  0.0527358  -6.585 5.29e-11 ***
## ZonaTetuán                 0.0252028  0.0361668   0.697 0.485947    
## ZonaUsera                 -0.3671235  0.0523111  -7.018 2.73e-12 ***
## ZonaVicálvaro             -0.2965868  0.0835456  -3.550 0.000391 ***
## ZonaVilla de Vallecas     -0.3394939  0.0537508  -6.316 3.05e-10 ***
## ZonaVillaverde            -0.3965639  0.0545057  -7.276 4.31e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3339 on 3213 degrees of freedom
## Multiple R-squared:  0.6525, Adjusted R-squared:  0.6499 
## F-statistic: 251.4 on 24 and 3213 DF,  p-value: < 2.2e-16

Right off the bat we get a pretty solid model in terms of variable signification.There are a few areas that are not significative such as Moncloa - Aravaca,Pardo and Tetuán.The r squared is not to high but with a few adjustment we can increase it.

Now its time to check the assumptions made by the model.First lets analyze the residuals to check whether they are normal or not.

res<-linear_model$residuals
qqPlot(res)

## 3733    6 
## 1356 2715

The data doesn’t look normal from the qqplot, so lets run a shapiro test to measure it.

shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.98813, p-value = 7.854e-16

The residuals are clearly not normal distributed. Depending on the goal of the analysis this could be an issue.If the goal is to predict is not the end of the world and we still can trust the prediction resutls, but, if the goal is to make inference from the data this is problematic since we won’t be able to trust the models outcomes.

Lets test multicolinearity.

madrid_subset <- madrid[, !names(madrid) %in% "Zona"]
mat_cor<-cor(madrid_subset)
corrplot(mat_cor,method = "ellipse")

car::vif(linear_model)
##                   GVIF Df GVIF^(1/(2*Df))
## Square.Meters 3.156749  1        1.776724
## Rooms         2.195207  1        1.481623
## Bathrooms     3.345565  1        1.829089
## Zona          1.199243 21        1.004335

It doesn’t look that there`s much multicolinearity between the variables in the model fitted. Now its time to use the stepwise function which will select the variables in the model dropping the non significative ones.

final_linear_model<-stepwise(linear_model)
## 
## Direction:  backward/forward
## Criterion:  BIC 
## 
## Start:  AIC=-6926.44
## Price ~ Square.Meters + Rooms + Bathrooms + Zona
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       358.24 -6926.4
## - Rooms          1     1.452 359.69 -6921.4
## - Bathrooms      1    27.903 386.14 -6691.7
## - Square.Meters  1    39.088 397.33 -6599.2
## - Zona          21   136.303 494.54 -6052.1

The final model has the variables Square Meters, Bathrooms ,Rooms and Zona.

Looking at the relationship between the variables I suggest adding the interaction of Square Meters ,Bathrooms and Rooms

model_with_interaction<-lm(Price~(Square.Meters*Bathrooms*Rooms)+Zona,data=train)
summary(model_with_interaction)
## 
## Call:
## lm(formula = Price ~ (Square.Meters * Bathrooms * Rooms) + Zona, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78910 -0.19632 -0.01664  0.18777  1.46084 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.403e+00  4.551e-02 140.690  < 2e-16 ***
## Square.Meters                  8.109e-03  5.282e-04  15.352  < 2e-16 ***
## Bathrooms                      3.985e-01  2.440e-02  16.334  < 2e-16 ***
## Rooms                          1.228e-01  1.587e-02   7.735 1.37e-14 ***
## ZonaBarajas                   -2.770e-01  8.271e-02  -3.349 0.000820 ***
## ZonaCarabanchel               -2.963e-01  4.549e-02  -6.513 8.49e-11 ***
## ZonaCentro                     1.966e-01  2.843e-02   6.914 5.64e-12 ***
## ZonaChamartín                  1.434e-01  3.433e-02   4.177 3.03e-05 ***
## ZonaChamberí                   2.584e-01  3.059e-02   8.448  < 2e-16 ***
## ZonaCiudad Lineal             -1.460e-01  4.237e-02  -3.445 0.000579 ***
## ZonaFuencarral - El Pardo     -2.589e-01  5.682e-02  -4.557 5.39e-06 ***
## ZonaHortaleza                 -1.612e-01  4.703e-02  -3.428 0.000617 ***
## ZonaLatina                    -1.977e-01  4.481e-02  -4.413 1.05e-05 ***
## ZonaMoncloa - Aravaca          3.918e-03  3.869e-02   0.101 0.919346    
## ZonaMoratalaz                 -3.811e-01  1.428e-01  -2.668 0.007665 ** 
## ZonaPardo                     -2.928e-02  6.314e-02  -0.464 0.642814    
## ZonaPuente de Vallecas        -4.005e-01  8.517e-02  -4.703 2.68e-06 ***
## ZonaRetiro                     1.800e-01  3.458e-02   5.206 2.06e-07 ***
## ZonaSalamanca                  2.942e-01  3.027e-02   9.720  < 2e-16 ***
## ZonaSan Blas - Canillejas     -3.558e-01  4.958e-02  -7.176 8.86e-13 ***
## ZonaTetuán                     9.939e-03  3.400e-02   0.292 0.770039    
## ZonaUsera                     -3.460e-01  4.919e-02  -7.034 2.45e-12 ***
## ZonaVicálvaro                 -3.526e-01  7.856e-02  -4.488 7.42e-06 ***
## ZonaVilla de Vallecas         -3.027e-01  5.060e-02  -5.982 2.45e-09 ***
## ZonaVillaverde                -3.978e-01  5.132e-02  -7.751 1.21e-14 ***
## Square.Meters:Bathrooms       -1.110e-03  1.660e-04  -6.686 2.70e-11 ***
## Square.Meters:Rooms           -8.256e-04  1.408e-04  -5.863 5.00e-09 ***
## Bathrooms:Rooms               -4.474e-02  5.733e-03  -7.804 8.03e-15 ***
## Square.Meters:Bathrooms:Rooms  1.545e-04  3.352e-05   4.611 4.17e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3137 on 3209 degrees of freedom
## Multiple R-squared:  0.6936, Adjusted R-squared:  0.691 
## F-statistic: 259.5 on 28 and 3209 DF,  p-value: < 2.2e-16
car::vif(model_with_interaction, type = 'predictor')
## GVIFs computed for predictors
##                  GVIF Df GVIF^(1/(2*Df))           Interacts With
## Square.Meters 1.31666  7        1.019844         Bathrooms, Rooms
## Bathrooms     1.31666  7        1.019844     Square.Meters, Rooms
## Rooms         1.31666  7        1.019844 Square.Meters, Bathrooms
## Zona          1.31666 21        1.006571                     --  
##                              Other Predictors
## Square.Meters                            Zona
## Bathrooms                                Zona
## Rooms                                    Zona
## Zona          Square.Meters, Bathrooms, Rooms

No problems with multicollinearity.

To make sure this model is better than the previuos one we can use the Analysis of variance.(Anova)

anova(final_linear_model,model_with_interaction)
## Analysis of Variance Table
## 
## Model 1: Price ~ Square.Meters + Rooms + Bathrooms + Zona
## Model 2: Price ~ (Square.Meters * Bathrooms * Rooms) + Zona
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3213 358.24                                  
## 2   3209 315.81  4    42.425 107.77 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p value is very small indicating that the second model explainds better the data.

For measuring the accuracy of the model will be using RMSE and the \(\ R^2\).

accuracy <- function(pred, obs, na.rm = FALSE, 
                     tol = sqrt(.Machine$double.eps), 
                     row_name = NULL, 
                     results_df = NULL) {
  
  # Calculate errors
  err <- obs - pred 
  
  if(na.rm) {
    is.a <- !is.na(err)
    err <- err[is.a]
    obs <- obs[is.a]
  }
  
  # Calculate metrics
  metrics <- c(
    rmse = sqrt(mean(err^2)), 
    mae = mean(abs(err)), 
    r.squared = 1 - sum(err^2) / sum((obs - mean(obs))^2)
  )
  
  # Convert metrics to a data frame
  metrics_df <- as.data.frame(t(metrics))
  
  # Add row name if provided
  if (!is.null(row_name)) {
    rownames(metrics_df) <- row_name
  }
  
  # Append metrics to the results data frame if provided
  if (!is.null(results_df)) {
    results_df <- rbind(results_df, metrics_df)
  } else {
    results_df <- metrics_df
  }
  
  return(results_df)
}
obs<-test$Price
lm_prediction<-predict(model_with_interaction,newdata=test)
ab<-accuracy(lm_prediction,test$Price,na.rm = T,row_name = "Linear Regression" )
pred.plot(lm_prediction, obs, xlab = "Predicción", ylab = "Observado")

4.2 Ridge Regression

\[ \hat{y} = X \beta + \lambda \sum_{j=1}^{p} \beta_j^2 \] Where 𝜆 is the penalty parameter.This type of model tends to consider all the variables but reducing the size of the coefficients.

x_train<-(model.matrix(Price ~ ., data = train))[, -1]

y_train<- train$Price

x_test<-(model.matrix(Price ~ ., data = test))[, -1]
y_test<-test$Price
set.seed(1)
cv.ridge <- cv.glmnet(x_train, y_train, alpha = 0)
plot(cv.ridge)

cv.ridge$lambda.1se
## [1] 0.0987712

Our lambda is 0.1724154.

coef(cv.ridge)
## 25 x 1 sparse Matrix of class "dgCMatrix"
##                                     s1
## (Intercept)                6.993662420
## Square.Meters              0.002536255
## Rooms                      0.047488132
## Bathrooms                  0.179370720
## ZonaBarajas               -0.275217796
## ZonaCarabanchel           -0.346419613
## ZonaCentro                 0.107474561
## ZonaChamartín              0.068648020
## ZonaChamberí               0.194867586
## ZonaCiudad Lineal         -0.188903310
## ZonaFuencarral - El Pardo -0.239033638
## ZonaHortaleza             -0.179712904
## ZonaLatina                -0.269942836
## ZonaMoncloa - Aravaca     -0.084214931
## ZonaMoratalaz             -0.403232153
## ZonaPardo                 -0.119121625
## ZonaPuente de Vallecas    -0.415076422
## ZonaRetiro                 0.110994306
## ZonaSalamanca              0.244993851
## ZonaSan Blas - Canillejas -0.367939788
## ZonaTetuán                -0.043127245
## ZonaUsera                 -0.389398075
## ZonaVicálvaro             -0.317010742
## ZonaVilla de Vallecas     -0.369420984
## ZonaVillaverde            -0.415570046

As can be seen in the coefficients none of them are really high.

pred <- predict(cv.ridge, newx = x_test) # s = "lambda.1se"
ridge<-accuracy(pred, y_test,row_name = "Ridge regression");ridge
##                       rmse       mae r.squared
## Ridge regression 0.3353355 0.2679801 0.6316163
performance<-rbind(ab,ridge)

4.3 Lasso Regression

\[ \hat{y} = X \beta + \lambda \sum_{j=1}^{p} |\beta_j| \] LASSO can shrink some coefficients to exactly zero. This effectively removes certain features from the model, making it useful for feature selection.

cv.lasso<- cv.glmnet(x_train, y_train, alpha = 1)
plot(cv.lasso)

cv.lasso$lambda.1se
## [1] 0.01808242
coef(cv.lasso)
## 25 x 1 sparse Matrix of class "dgCMatrix"
##                                     s1
## (Intercept)                6.943187816
## Square.Meters              0.002905329
## Rooms                      .          
## Bathrooms                  0.214492280
## ZonaBarajas               -0.005263230
## ZonaCarabanchel           -0.202093999
## ZonaCentro                 0.145768620
## ZonaChamartín              0.067491485
## ZonaChamberí               0.230704034
## ZonaCiudad Lineal         -0.045164820
## ZonaFuencarral - El Pardo -0.039193100
## ZonaHortaleza             -0.019478162
## ZonaLatina                -0.114153467
## ZonaMoncloa - Aravaca      .          
## ZonaMoratalaz              .          
## ZonaPardo                  .          
## ZonaPuente de Vallecas    -0.143026455
## ZonaRetiro                 0.117361005
## ZonaSalamanca              0.286424246
## ZonaSan Blas - Canillejas -0.209006690
## ZonaTetuán                 .          
## ZonaUsera                 -0.239168580
## ZonaVicálvaro             -0.062302102
## ZonaVilla de Vallecas     -0.205297458
## ZonaVillaverde            -0.257253325

Here the coefficients remove are exactly those that where non significant in the linear regression model, so the final model has 19 variables.

pred <- predict(cv.lasso, newx = x_test)
lasso<-accuracy(pred, y_test,row_name = "Lasso");lasso
##            rmse       mae r.squared
## Lasso 0.3423582 0.2749201 0.6160249
performance<-rbind(performance,lasso)

4.4 Elastic Net

Elastic Net combines LASSO and Ridge.

caret.glmnet <- train(x_train, y_train, method = "glmnet", 
                      preProc = c("zv", "center", "scale"), tuneLength = 10,
                      trControl = trainControl(method = "cv", number = 10))
caret.glmnet
## glmnet 
## 
## 3238 samples
##   24 predictor
## 
## Pre-processing: centered (24), scaled (24) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2914, 2914, 2915, 2915, 2915, 2913, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.1    0.0004158078  0.3350818  0.6518788  0.2599257
##   0.1    0.0009605700  0.3350818  0.6518788  0.2599257
##   0.1    0.0022190412  0.3350818  0.6518788  0.2599257
##   0.1    0.0051262730  0.3350615  0.6519340  0.2601036
##   0.1    0.0118423555  0.3350957  0.6520260  0.2607078
##   0.1    0.0273573772  0.3355603  0.6518362  0.2622981
##   0.1    0.0631990896  0.3380792  0.6500819  0.2670688
##   0.1    0.1459980940  0.3488168  0.6405680  0.2809946
##   0.1    0.3372745330  0.3811772  0.6037253  0.3137708
##   0.2    0.0004158078  0.3350877  0.6518664  0.2599069
##   0.2    0.0009605700  0.3350877  0.6518664  0.2599069
##   0.2    0.0022190412  0.3350877  0.6518676  0.2599094
##   0.2    0.0051262730  0.3350923  0.6519408  0.2602398
##   0.2    0.0118423555  0.3352439  0.6519592  0.2610456
##   0.2    0.0273573772  0.3363540  0.6510874  0.2635569
##   0.2    0.0631990896  0.3417516  0.6455568  0.2717741
##   0.2    0.1459980940  0.3624660  0.6180718  0.2951903
##   0.2    0.3372745330  0.4080538  0.5499976  0.3370574
##   0.3    0.0004158078  0.3350978  0.6518545  0.2598920
##   0.3    0.0009605700  0.3350978  0.6518545  0.2598920
##   0.3    0.0022190412  0.3350943  0.6518750  0.2599532
##   0.3    0.0051262730  0.3351390  0.6519206  0.2603825
##   0.3    0.0118423555  0.3354885  0.6517211  0.2614895
##   0.3    0.0273573772  0.3374738  0.6497550  0.2652097
##   0.3    0.0631990896  0.3468303  0.6376983  0.2776144
##   0.3    0.1459980940  0.3761497  0.5922956  0.3081332
##   0.3    0.3372745330  0.4219972  0.5347495  0.3493966
##   0.4    0.0004158078  0.3351007  0.6518514  0.2598784
##   0.4    0.0009605700  0.3351007  0.6518514  0.2598784
##   0.4    0.0022190412  0.3351092  0.6518703  0.2600126
##   0.4    0.0051262730  0.3352028  0.6518714  0.2605330
##   0.4    0.0118423555  0.3358049  0.6513631  0.2620334
##   0.4    0.0273573772  0.3389111  0.6477980  0.2671329
##   0.4    0.0631990896  0.3530272  0.6267536  0.2842229
##   0.4    0.1459980940  0.3898413  0.5615312  0.3200368
##   0.4    0.3372745330  0.4332329  0.5269656  0.3590144
##   0.5    0.0004158078  0.3351031  0.6518458  0.2598562
##   0.5    0.0009605700  0.3351031  0.6518458  0.2598562
##   0.5    0.0022190412  0.3351267  0.6518618  0.2600746
##   0.5    0.0051262730  0.3352824  0.6517952  0.2607027
##   0.5    0.0118423555  0.3361696  0.6509316  0.2626429
##   0.5    0.0273573772  0.3407377  0.6449880  0.2694845
##   0.5    0.0631990896  0.3589289  0.6159205  0.2903502
##   0.5    0.1459980940  0.3973219  0.5474109  0.3267314
##   0.5    0.3372745330  0.4441066  0.5267091  0.3680285
##   0.6    0.0004158078  0.3351091  0.6518409  0.2598605
##   0.6    0.0009605700  0.3351091  0.6518409  0.2598605
##   0.6    0.0022190412  0.3351472  0.6518482  0.2601377
##   0.6    0.0051262730  0.3353809  0.6516874  0.2608908
##   0.6    0.0118423555  0.3365810  0.6504194  0.2632992
##   0.6    0.0273573772  0.3428302  0.6415898  0.2720196
##   0.6    0.0631990896  0.3648122  0.6044508  0.2962481
##   0.6    0.1459980940  0.4031304  0.5381556  0.3320547
##   0.6    0.3372745330  0.4567186  0.5262713  0.3782436
##   0.7    0.0004158078  0.3351090  0.6518383  0.2598508
##   0.7    0.0009605700  0.3351107  0.6518381  0.2598649
##   0.7    0.0022190412  0.3351708  0.6518292  0.2602015
##   0.7    0.0051262730  0.3354885  0.6515661  0.2610911
##   0.7    0.0118423555  0.3370644  0.6497728  0.2640249
##   0.7    0.0273573772  0.3451076  0.6377246  0.2746755
##   0.7    0.0631990896  0.3708486  0.5919455  0.3019369
##   0.7    0.1459980940  0.4084697  0.5304748  0.3368140
##   0.7    0.3372745330  0.4712417  0.5254267  0.3899085
##   0.8    0.0004158078  0.3351104  0.6518347  0.2598421
##   0.8    0.0009605700  0.3351170  0.6518355  0.2598896
##   0.8    0.0022190412  0.3351975  0.6518057  0.2602672
##   0.8    0.0051262730  0.3356049  0.6514339  0.2613025
##   0.8    0.0118423555  0.3376250  0.6489749  0.2648216
##   0.8    0.0273573772  0.3474947  0.6335652  0.2774138
##   0.8    0.0631990896  0.3773437  0.5774159  0.3077672
##   0.8    0.1459980940  0.4129159  0.5260807  0.3408122
##   0.8    0.3372745330  0.4878927  0.5235124  0.4029263
##   0.9    0.0004158078  0.3351115  0.6518320  0.2598350
##   0.9    0.0009605700  0.3351236  0.6518323  0.2599146
##   0.9    0.0022190412  0.3352268  0.6517776  0.2603341
##   0.9    0.0051262730  0.3357272  0.6512949  0.2615302
##   0.9    0.0118423555  0.3382817  0.6479732  0.2657017
##   0.9    0.0273573772  0.3499519  0.6291315  0.2801228
##   0.9    0.0631990896  0.3835704  0.5628340  0.3132165
##   0.9    0.1459980940  0.4165865  0.5256772  0.3441245
##   0.9    0.3372745330  0.5067922  0.5179439  0.4174473
##   1.0    0.0004158078  0.3351126  0.6518292  0.2598296
##   1.0    0.0009605700  0.3351311  0.6518276  0.2599406
##   1.0    0.0022190412  0.3352594  0.6517444  0.2604037
##   1.0    0.0051262730  0.3358575  0.6511437  0.2617711
##   1.0    0.0118423555  0.3390209  0.6467949  0.2666697
##   1.0    0.0273573772  0.3525407  0.6242712  0.2828532
##   1.0    0.0631990896  0.3878177  0.5534166  0.3170117
##   1.0    0.1459980940  0.4206505  0.5250983  0.3476921
##   1.0    0.3372745330  0.5279847  0.4987109  0.4334937
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.005126273.

Here I used a bunch of different combinations of parameters to see if got better results.

ggplot(caret.glmnet, highlight = TRUE)
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 9 values. Consider specifying shapes manually if you need
##   that many have them.
## Warning: Removed 30 rows containing missing values or values outside the scale range
## (`geom_point()`).

pred <- predict(caret.glmnet, newdata = x_test)
elastic<-accuracy(pred, y_test,row_name = "Elastic Net")
performance<-rbind(performance,elastic)

4.5 Random Forest

rf<-randomForest(Price~.,data=train);rf
## 
## Call:
##  randomForest(formula = Price ~ ., data = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.1019787
##                     % Var explained: 67.97
plot(rf, main = "Random Forest")

The error stabilizes around 200 trees in.

varImpPlot(rf)

Using random forest we can see which variables are the most important.Square Meters its the most important one followed by Bathrooms, these two variables explained most of the model, then we get Zona (neighborhood) and finally Room seems to be the less important, this could be because the the information contained in Rooms is already contain in Square Meters.

pdp1 <- partial(rf, "Square.Meters")
p1 <- plotPartial(pdp1)
pdp2 <- partial(rf, c("Bathrooms"))
p2 <- plotPartial(pdp2)
pdp3 <- partial(rf, c("Rooms"))
p3 <- plotPartial(pdp3)
grid.arrange(p1, p2,p3, ncol = 3)

The partial plot allow us to see how one particular variable interacts with the target variable. In this case we can see that increasing the surface of the aparments increases the prices up to a certain point where stops, this could be caused by a lack of data.The bathrooms have the same trend and finally increasing the number of rooms increases the price but when we get to 7 rooms there’s a decrease of the price, again this could either be because the lack of data (I think there’s only one aparment with 10 rooms) or it coulb be an underlying trend.

pred <- predict(rf, newdata = test)
rf<-accuracy(pred, obs,row_name = "Random Forest");rf
##                    rmse       mae r.squared
## Random Forest 0.3151583 0.2555943 0.6746139
performance<-rbind(performance,rf)

4.6 Generalize Additive Models (GAM)

gam <- gam(
  Price ~ s(Square.Meters,Bathrooms,Rooms)+ Zona, 
  data = train, 
  family = gaussian()
)
summary(gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Price ~ s(Square.Meters, Bathrooms, Rooms) + Zona
## 
## Parametric coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                7.55815    0.02584 292.542  < 2e-16 ***
## ZonaBarajas               -0.27433    0.08159  -3.362 0.000783 ***
## ZonaCarabanchel           -0.28326    0.04507  -6.285 3.73e-10 ***
## ZonaCentro                 0.20928    0.02804   7.463 1.09e-13 ***
## ZonaChamartín              0.14544    0.03390   4.291 1.83e-05 ***
## ZonaChamberí               0.26413    0.03027   8.725  < 2e-16 ***
## ZonaCiudad Lineal         -0.13789    0.04193  -3.289 0.001017 ** 
## ZonaFuencarral - El Pardo -0.24541    0.05607  -4.377 1.24e-05 ***
## ZonaHortaleza             -0.16130    0.04647  -3.471 0.000525 ***
## ZonaLatina                -0.18936    0.04415  -4.289 1.85e-05 ***
## ZonaMoncloa - Aravaca      0.01431    0.03830   0.374 0.708644    
## ZonaMoratalaz             -0.39818    0.14044  -2.835 0.004608 ** 
## ZonaPardo                 -0.05117    0.06235  -0.821 0.411876    
## ZonaPuente de Vallecas    -0.40069    0.08380  -4.782 1.82e-06 ***
## ZonaRetiro                 0.17702    0.03411   5.190 2.23e-07 ***
## ZonaSalamanca              0.29334    0.02987   9.820  < 2e-16 ***
## ZonaSan Blas - Canillejas -0.36123    0.04900  -7.373 2.12e-13 ***
## ZonaTetuán                 0.01693    0.03350   0.505 0.613256    
## ZonaUsera                 -0.33463    0.04876  -6.863 8.08e-12 ***
## ZonaVicálvaro             -0.34632    0.07737  -4.476 7.88e-06 ***
## ZonaVilla de Vallecas     -0.30178    0.04996  -6.041 1.71e-09 ***
## ZonaVillaverde            -0.40584    0.05098  -7.962 2.35e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                    edf Ref.df     F p-value    
## s(Square.Meters,Bathrooms,Rooms) 44.93   54.8 88.33  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.702   Deviance explained = 70.8%
## GCV = 0.096829  Scale est. = 0.094828  n = 3238
vis.gam(gam, view = c("Square.Meters", "Bathrooms"), plot.type = "persp", 
        theta = 30, phi = 30, 
        main = "3D Interaction of Square Meters and Bathrooms",
        color = "heat", zlab = "Price")

vis.gam(gam, view = c("Square.Meters", "Bathrooms"), plot.type = "contour", 
        color = "topo", main = "2D Contour Plot of Square Meters and Bathrooms")

In those two graphs we can see the interaction between the surface and bathrooms.The interaction between them can be clearly seen.

pred <- predict(gam, newdata = test)
g<-accuracy(pred, obs,row_name = "GAM");g
##          rmse       mae r.squared
## GAM 0.3114544 0.2464323 0.6822173
performance<-rbind(performance,g)
pred.plot(pred, obs, xlab = "Predicción", ylab = "Observado")

4.7 Boosting

trControl <- trainControl(method = "cv", number = 5)

caret.xgb <- train(Price ~ ., method = "xgbTree", data = train,
                   trControl = trControl, verbosity = 0)
caret.xgb
pred <- predict(caret.xgb, newdata = test)
xgb<-accuracy(pred, obs,row_name = "XGB");xgb
##          rmse       mae r.squared
## XGB 0.3029643 0.2410419 0.6993063
performance<-rbind(performance,xgb)

5 Results

rmse mae r.squared
Linear Regression 0.3080600 0.2458122 0.6891063
Ridge regression 0.3353355 0.2679801 0.6316163
Lasso 0.3423582 0.2749201 0.6160249
Elastic Net 0.3337781 0.2625992 0.6350301
Random Forest 0.3151583 0.2555943 0.6746139
GAM 0.3114544 0.2464323 0.6822173
XGB 0.3029643 0.2410419 0.6993063

After testing different models, the results were quite similar. The XGBoost model had the highest R-squared, but the difference compared to the other models was very small. For this reason, the linear regression model seems like a better option because it is simple, easy to understand, and works efficiently.

In summary, the models were not able to fully explain the variations in the data. One of the main challenges was the high variability in the data, which made it difficult for the models to detect clear patterns. Another possible reason is the lack of important information, such as the condition or quality of the apartment, which can have a big effect on the rental price. Including this type of information in future models could help improve the predictions.